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There are a lot of factors involved in determining how u can бла our Way. yaroune and avoid delays, bad 
weather,dangers and expenses. In this talk. | will focus pubis transport in the largest transit system in 
the United States, the MTA, ` ` S ` 
which is focused around New York City. Utilizing public gra dns public Чай feeds, this can be extended to 
most city and поры areas around the world. A па! example, | live in New Jersey d this is 
an extremely useful а ‚open source and public JA! 
data. 4 ۹ DIG ru % ! n | N 
3 11 M ж iB дА 3. 
Once lam notified that | need to یع‎ to Manhattan, | eed to start my data streams flowing. Most of the 
data sources are REST feeds that T ingested by Apache NiFi.to transform; "convert, enrich and finalize it for 
usage in streaming tables L Flink SQL, but also keep that same contract with Kafka consumers, Iceberg 
tables and other users 4 data. | do not need to many user interfaces 5 interopt \ ith the system as | 
want my final decision sent in a Slack message to me T th in Ill ER ПЕ . Alon way data will be 
visible i = NiFi lineage, Kafka topic views, Flink SQL QR | REST o utput së P dubi. . 


Apache NiFi, Apache 04 ‚Apache OpenNLP, Ар ٤ Tika, Apache nk acc erh Apache Parquet, 
Apache Iceberg. 


https: github. سک سی‎ MTA/tree/main A ` д 
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Q. What is contain it ys уыл 
Wisdom. 7 
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Trains, Planes and Automobiles +++ 1 


Local weather conditions e XML, JSON, RSS 


Mass transit status & alerts e XML, JSON, RSS 


Regional highways & tunnels e GeoRSS, XML, ProtoBuf, JSON 


Local social media ۴ JSON 


ADS-B Plane Data ° JSON 


Local air quality e JSON 
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IME REQUIRES A PLATFORM 
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INGEST OF ALL TRANSIT DATA 


Run collection and streaming on any cloud, server, 


container or VM 


: Data Sources 
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Microsoft Azure : 
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GTFS Realtime: | Vehicle Positions, Updates, Alerts 
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APACHE KAFKA 
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Apache Kafka tw — — 


MiNiFi Apache Kafka Apache NiFi y Apache Kafka Apache Flink 


DATA COLLECTION INGEST GATEWAY DATA FLOW APPS KAFKA CLUSTER GEO 1 STREAMING 
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US-West Fleet Acquire Events from Kafka IOT Gateways h “ - Stream Analytics App 


gateway-west- : ; E \ i y uc X Structured 
raw-sensors mi Spark: Streaming 


Micro Services 


US-Central Fleet Ze — E^ — | po Stream Analytics App 
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US-East Fleet ; s N РА " | 5 Complex Low Latent 
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APACHE FLINK 


CREATE TABLE 'ssb'." Meetups . hfbloom ( 
" generated text" VARCHAR (2147483647), 
"ts" VARCHAR(2147483647) , 
"x compute type" VARCHAR(2147483647) , 
"inputs? VARCHAR(2147483647) , 
"x compute time" VARCHAR(2147483647) , 
“х inference time" VARCHAR(2147483647) , 
`uuid` VARCHAR(2147483647) , 
"x time per. token" VARCHAR(2147483647) , 
'x compute characters" VARCHAR (2147483647), 
'eventTimeStamp' TIMESTAMP(3) WITH LOCAL TIME ZONE METADATA FROM 'timestamp', 


WATERMARK FOR “eventTimeStamp* AS 'eventTimeStamp' - INTERVAL '3' SECOND асығы аста садық 
) WITH ( 

'scan.startup.mode' - 'group-offsets', 

'properties.request.timeout.ms' - '120000', 

'properties.auto.offset.reset' - 'earliest', 

'format' = 'json', 

'properties.bootstrap.servers' = 'kafka:9092', 

'connector' = 'kafka', 

'properties.transaction.timeout.ms' = '900000', 

'topic' = 'hfbloom', 


'properties.group.id' = 'llmBloomProps' 
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RECORD-ORIENTED DATA WITH NIFI 


* Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet, 


Scripted, Syslog5424, Syslog, WindowsEvent, XML 


А | SASEA 
* Record Writers - Avro, CSV, FreeFromText, Json, Parquet, جک‎ 2 4 


Scripted, XML 


Configure Processor 


Record Reader and Writer support referencing a schema registry Semmes ` ` ` zen | omoes | comments 


Required field + 


Property Value 


Record Reader Ө  CSVReader 


Enable processors that accept any data format without having to — — 


for retrieving schemas when necessary. 


worry about the parsing and serialization logic. 


Allows us to keep FlowFiles larger, each consisting of multiple 


records, which results in far better performance. 


PROVENANCE 


Displaying 13 of 104 


Oldest event available: 11/15/2016 1 4:50 E Showing the most recent events 


ConsumeKafka by component name м Q 


Date/Time + Type FlowFile Uuid Size Component Name Component Type 


379fc4f6-60e0-4151-9743-28 


ConsumeKafka ConsumeKafka 


RECEIVE 


11/15/2016 8 44 bytes 


ConsumeKafka 


ConsumeKafka 


11/15/2016 13:35:02.7 RECEIVE 7818c38b-89fc-4d00-aBd8-51 44 bytes 


11/15/2016 13:35:01.6 2bcd5124-bb78-489f-adBa-7 44 bytes ConsumeKafka ConsumeKafka 


* Tracks data at each point as it flows ER 


through the system 
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° Handles fan-in/fan-out, i.e. merging tans 
and splitting data kafka partition 
* View attributes and content at given e 


points in time 
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import cv2 
import numpy as np 
import json In 0 (0 
from nifiapi.properties import PropertyDescriptor 
from nifiapi.properties import ResourceDefinition 
from nifiapi.flowfiletransform import FlowFileTransformResult Out 


Tasks/Time 0 / 00:00:00.000 


YTHON CUSTOM PROCESSORS 


GetFile 


org.apache.nifi- nifi-standard-nar 


SCALE FACTOR = 9.98392 
NMS THRESHOLD = 0.4 # non-maximum suppression threshold I 
CONFIDENCE THRESHOLD - 8.5 Name success 


class DetectObjectInImage: Queued 0 (0 bytes) 
class Java: Y 
implements = ['org.apache.nifi.python.processor.FlowFileTransform'] 
class ProcessorDetails: DetectObjectInlmage 
version = '8.8.1-SNAPSHOT' tectObjectinimage 1 
dependencies = ['numpy >= 1.23.5", ‘opencv-python >= 4.6'] org.apache.nffi- python-extensions 
Name success 
def init (self, јуп=Мопе, **kwargs): Queued 2 (772 
self.jvm = jvm 


# Build Property Descriptors 
self.model file = PropertyDescriptor( 
name = "Model File", 
description = "The binary file containing the trained Deep Neural Network weights. Supports Caffe (*.caffemodel), TensorFlow (*.pb), Torch (*.t7, *.net), Darknet (*.weights), ' + 
'DLDT (*.bin), and ONNX («.onnx)', 
required = True, 
resource definition = ResourceDefinition(allow file = True) 


) 
self.config file = PropertyDescriptor( 
name = "Network Config File', 
description = "The text file containing the Network configuration. Supports Caffe (*.prototxt), TensorFlow (*.pbtxt), Darknet (*.cfg), and DLDT (x.xml)', 
required - False, 
resource definition = ResourceDefinition(allow file = True) 
) 
self.class_name_file = PropertyDescriptor( 
name = "Class Names File', 
description = "A text file containing the names of the classes that may be detected by the model, Expected format is one class name per line, new-line terminated."', 
required = True, 
resource_definition = ResourceDefinition(allow_file = True) 
) 


self.descriptors - [self.model file, self.config file, self.class name file] 


def getPropertyDescriptors(self): 
return self.descriptors 


def onScheduled(self, context): 
* read class names from text file 
class name file = context.getProperty(self.class name file.name) .getValue() 


https://github.com/apache/nifi/blob/614947e4ac6798ad80817e82514c3' 


asciidoc/python-developer- 


Future of Data - NYC / Princeton + Virtual 


| qm FUTURE Ей E © 


| AN OPEN SOURCE COMMUNITY 


; 
e fi 
rd 


$8 Kafka e | 


https://www.meetup.com/futureofdata-princeton/ 
https://www.meetup.com/futureofdata-newyork/ 


From Big Data to Al to Streaming to Containers to 


Cloud to Analytics to Cloud Storage to Fast Data to 
Machine Learning to Microservices to ... 
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This En I SE ache NiFi, Apache = a peel Ehe 
Kafka, Apache Spark, Apache Iceberg, ython, Java, 
Al, ML, LLM L Ed Open Source friends. | 
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